A new Hedging algorithm and its application to inferring latent 

random variables 

Yoav Freund and Daniel Hsu 
{yf reund,djhsu}@cs .ucsd.edu 

June 30, 2008 

Abstract 

We present a new online learning algorithm for cumulative discounted gain. This learning algorithm 
does not use exponential weights on the experts. Instead, it uses a weighting scheme that depends 
on the regret of the master algorithm relative to the experts. In particular, experts whose discounted 
cumulative gain is smaller (worse) than that of the master algorithm receive zero weight. We also sketch 
how a regret-based algorithm can be used as an alternative to Bayesian averaging in the context of 
inferring latent random variables. 

1 Introduction 

We study a variation on the online allocation problem presented by Freund and Schapire in |FS97| . Our 
problem varies from the original in that we use discounted cumulative loss instead of regular cumulative loss. 
Specifically, we consider the following iterative game between a hedger and Nature. 

In this setting, there are N actions (e.g. strategies, experts) indexed by i. The game between the hedger 
and Nature proceeds in iterations j — 0, 1, 2, . . .. In the jth iteration: 

1. The hedger chooses a distribution {p{}f =1 over the actions, where p\ > and ^2i=iPi = 1- 

2. Nature associates a gain g\ £ [—1, 1] with action i. 

3. The gain of the hedger is g\ — J2iLi Pi9i- 

We define the discounted total gain as follows. The initial total gain is zero Gf = 0. The total gain for action 
i at the start of iteration j + 1 is defined inductively as: 

Gl +1 = (l-a)Gl+gl 

for some fixed discount factor a > 0. The discounted total loss of the hedger is similarly defined: 

G°=0, G A +1 =(l-a)Gi+g j A . 
We define the regret of the hedger with respect to action i at the start of iteration j as 

,! ', ( '"\ 

It is easy to see that the regret obeys the following recursion: 

Ri = 0, i?f 1 = (1 - a)Ri +gl-g A . 

Our goal is to find a hedging algorithm for which we can show a small uniform upper bound on the regret, 
i.e. a small positive real number B(a) such that Rj < B(a) for all choices of Nature, all i and all j. 
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Our new hedging algorithm, which we call NormalHedge, uses the following weighting: 




(1) 



The hedging distribution is equal to the normalized weights p\ = w\ / J^,, w J k unless all of the weights are 
zero, in which case we use the uniform distribution p l i = 1/N. 

Our main result is that if a is sufficiently small, the following inequality holds uniformly over all game 
histories: 

N 

N 



where 

if x > 



$(x) = 

This implies, in particular, that for any i and j, 

r! < 



if x < Q. 



Sin 2.327V 



The discount factor a plays a similar role to the number of iterations in the standard undiscountcd 
cumulative loss framework. Indeed, it is easy to transform the usual exponential weights algorithms from 
the standard framework (e.g. Hedge |FS97| ) to our present setting (Section [3|). Such algorithms also enjoy 
discounted cumulative regret bounds of 



for some positive constant C, but they require knowledge of the number of actions N to tune a learning 
parameter. The tuning of NormalHedge does not have this requirement. 

The rest of this paper is organized as follows. In Section [2] we describe the main ideas behind the 
construction and analysis of NormalHedge. In Sections [3] and [4] we discuss related work and compare 
NormalHedge to exponential weights algorithms. Finally, in Section [5] we suggest how to use NormalHedge 
to track latent variables and sketch how that might be used for learning HMMs under the L\ loss. 



2 NormalHedge 
2.1 Preliminaries 

NormalHedge and its analysis are based on the potential function &(x) introduced in Section [T] Here we 
give a slightly more elaborate definition for <j>(x) that includes a constant c. The potential function is a 
non-decreasing function of x G R 

X > (2) 

1 ifx<0 W 

where c > 1. In our current version of NormalHedge, c = 4. Decreasing c will improve the bound on the 
regret; we will also argue that c cannot be decreased to 1. 

The weights assigned by NormalHedge are set proportional to the first derivative of <I>, i.e. w 3 , = ^'(Rj), 
where 

'lL e x 2 /2c - lix>0 

if x < 0. 



$'0) = 



1 The guarantees afforded to NormalHedge require a to be sufficiently smaller than 1/ In N, but this restriction is operationally 
different from needing to know N in advance. 
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In our analysis, we will also need to examine the second derivative of $: 




if x > 
if x < 0. 



Note that <&"(x) has a discontinuity at x = 0. 



2.2 An intuitive derivation 

The intuition behind the potential function is based on considering the following strategy for Nature. Suppose 
there are two types of actions, good actions and poor actions. The gain for each action on each iteration is 
chosen independently at random from a distribution over { — 1, +1}. The distribution for poor actions has 
equal probabilities 1/2, 1/2 on the two outcomes, while the distribution for the good experts is (1 +j)/2 on 
+1 and (1 — 7)/2 on —1 for some very small 7 > 0. Clearly, the best hedging strategy is to put equal positive 
weights on the good actions and zero weight on the poor actions. Unfortunately, the hedging algorithm 
does not know at the beginning of the game which experts are good, so it has to learn these weights online. 
Assuming that the number of actions is infinite (or sufficiently large), the per- iteration gain of the optimal 
weighting is 7, which implies that the discounted cumulative gain of this strategy is 7/a. 

Consider the regrets of this optimal hedging with respect to the good actions. It is not hard to show 
that the expected value of the discounted cumulative gain of a good action is 7/a and that the variance is 
approximately l/a (becomes exact as 7 — > 0). Moreover, if a — > this distribution approaches a normal 
distribution with mean 7/a and variance l/a. In other words, the distribution of the regrets of optimal 
hedging with respect to the good actions is (1/Z) exp(— aR 2 /2). 

Consider the expected value of the potential function Q(y/aR) for this distribution over the regrets. If 
we set c = 1 we find that the product of the probability of the regret R and the potential for the regret R is 
a constant independent of R: 

1 / aR 2 \ faR 2 \ 1 

-•exp j-exp 0(1). 

Thus the expected potential is infinite. However, if we set c to be larger than 1 then the expected value of 
the potential function becomes finite. Thus, roughly speaking, the potential associated with a regret value 
is the reciprocal of the probability of that regret value being a result of random fluctuations. This level of 
regret is unavoidable. The design of NormalHedge is based on the goal of not allowing the average regret to 
grow beyond this level that is generated by random fluctuations. Ideally, we would be able to use a potential 
function with any constant c larger than 1. However, what we are able to prove is that the algorithm works 
for c = 4. 

The idea of NormalHedge is to keep the average potential small. It is therefore natural that the weight 
assigned to each action is proportional to the derivative of the potential. Indeed, it is easily checked that the 
weights w\ defined in Equation ([1]) are proportional to &(^/aRj). This derivative, however, is best viewed 
when the hedging game is mapped into continuous time. 



2.3 The continuous time limit 

Our analysis of NormalHedge is based on mapping the integer time steps j = 0, 1, 2, . . . into real-valued 
time steps t = 0, a, 2a, . . . and then taking the limit a — ► 0. Formally, we redefine the hedging game using 
a different notation which uses the real valued time t instead of the time index j. We assume a set of N 
actions (experts), indexed by i. The game between the hedging algorithm and Nature proceeds in iterations 
t = 0, a, 2a, .... At each iteration the following sequence of actions take place. 

1. The hedging algorithm defines a distribution {pi (t)}f =1 over the actions, pi (t) > 0; Yli=i Pi (t) = 1. 

2. Nature associates a gain (t) G [—y/a, +\fa\ with action i. 

3. The gain of the hedger is g A (t) = J2i=i Pi (*) 9i (*)• 
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We skip the definitions of Gi{t) and G A (t) as these can become ill-behaved when a — > 0. Instead we 
define the regret directly: 

Ri (0) = 0, Ri (i + a) = (1 - a)i2j (t) + ft (i) - g A (t) . 
Note that this definition of the regret is a scaled version of the discrete time regret: 

Ri {ja) = ^R\. 
We now have the tools needed to prove our main result. 

Theorem 1 There exists a positive constantC < 2.32 such that if a < 1/(800 In CAT), then for any sequence 
of gains and any iteration j 

N 



Proof sketch. The full proof is given in the appendix, but here we sketch a continuous-time argument 
(i.e. we consider a — ► 0). The formal, discrete-time proof shows that it is enough for a < 1/(800 In CW). 
We want to show that the average potential 



*(*) = ^E*(*) 

i=l 

is bounded for all time t. Our approach is to show that its time-derivative 

JV 



JU(t) = lim I • 1 ]T (t + a)) (t))} 



i=i 



becomes non-positive as soon as $>(t) is above some constant (recall that the time steps are in increments 
of a). Since the &(x) is constant for x < 0, we need only consider i such that Ri(t + a) > 0. Ignoring the 
discontinuity of $"(x) at x = 0, Taylor's theorem implies that for some p; < max{i?i (t) ,Ri(t + a)}, 

£ ^(Ri(t + a))-^(Ri(t)) = ]T H{l-a)R l {t)+g l {t)-g A (t))-<$>(R l {t)) 

i:Ri(t)>0 i:Ri(t)>0 

= (-aRi(t) + gi(t)-g A (t))&(Ri(t)) 

i:Ri(t)>0 

+ \{9i{t) ~ 9A{t) - aRi (t))H"( Pl ) 

< -aRi(t)^'(Ri(t)) + ^(g i (t)-g A (t)-aR i (t))H"(p i ) 

i:Ri(t)>0 

< J2 -aR l (t)<P'(R t (t)) + ^(2^+aR l (t))H"(R l (t) + 2^). 

i:Ri(t)>0 

The first inequality uses the fact that the weights are proportional to the derivatives of the potentials 



£ 9m 'T V(R-(t)) =9Ait) > 
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and the second inequality follows because \gi(t) — <ju(i)| < 2^/a. Now dividing by a and N and taking the 
limit a — ► 0, we have 



at a-^o a iv ■* — ' 2 

i:Ri(t)>0 

= lim 4 E \(2 + ^R l {t)) 2 $>"{R l (t) + 2^)~R l (t)&(R t {t)) 

a— >0 iv ' Z 



i:B.i(t)>0 

= ^ 1 2 ^ + ^f) exp(i? 2 (i) 2 /2c) - exp(^ (i) 2 /2c) j 

< + ^ E - x ) ^ w 2 ex p(^ w 2 / 2c )- 

i— 1 ^ ' 

If ^(i) > B, then this final RHS is maximized when Ri(t) = V2cln_B for all z, whereupon 
This is non-positive for sufficiently large i? and c>2 + l/lnB. I 

3 Related work 

3.1 Relation to other online learning algorithms 

The Hedge algorithm FS97J, as well as most of the work on online learning algorithms is based on exponential 
weighting, where the weight assigned to an expert is exponential in the cumulative loss of that expert. 
NormalHedge uses a very different weighting scheme. The most important difference is that the weight of 
an expert depends on the regret of the master algorithm relative to that expert, rather than just on the loss 
of the algorithm. In particular, experts whose discounted cumulative loss is larger than that of the master 
algorithm receive zero weight. We expand on the comparison of NormalHedge to Hedge in Section HI 

The starting point for the derivation and analysis of NormalHedge is the Binomial Weights algorithm of 
Cesa-Bianchi et al [CBFHW96 . The Binomial weights algorithm is an algorithm for a restricted version of 
the experts prediction problem |LW94[ |CBFH + 97| . In this version sequence to be predicted is binary and all 
of the predictions are also binary. The Binomial Weights algorithm is analyzed using a type of chip game. 
In this game each expert is represented as a chip, at each iteration each chip has a location on the integer 
line. The position of the chip corresponds to the number of mistakes that were made by the expert. The 
a-priori assumption is that there is at least one experts which makes at most k mistakes, and the goal is to 
define a rule for combining the experts predictions in a way that would minimize the maximal number of 
mistakes of the master expert. 

The chip game analysis leads naturally to the definition of the potential function and the evolution of 
this potential function from iteration to iteration yields the Binomial Weights algorithm. A closely related 
notion of potential was used in the Boost-by-Majority algorithm. The chip-game analysis was extended by 
Schapire's work on drifting games [SchOlJ and by Freund and Opper's work on drifting games in continuous 
time |FO02| . NormalHedge naturally extends the continuous time drifting games to a setting in which one 
seeks to minimize discounted loss. 



3.2 Relation to switching and sleeping experts 

The use of discounted cumulative loss represents an alternative to the "switching experts" framework of 
Warmuth and Herbster |HW98] . If the best expert changes at a rate of 0(a), then NormalHedge will 
switch to the new best expert because the losses that occurred more than l/a iterations ago make a small 
contribution to the discounted total loss. 
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A useful extension of NormalHedge is to using experts that can abstain, similar to the setup studied in 
FSSW97J. To do this we assume that each expert i, at each iteration j, outputs a confidence level < c < 1. 
Instead of using the vector the hedger uses the vector {pjcl /Z^}^^ where Z — Yli^iPi^- The 

gain g\ of action i at iteration j is replaced by g\ , and the discounted cumulative gain and the discounted 
cumulative regret change in the corresponding way. The bounds on the average potential transfer without 
change. This allows an expert to abstain from making a prediction. By setting = the expert effectively 
removes itself from the pool of experts used by the hedger. It also avoids suffering any loss. However, an 
expert cannot always abstain, because then it's discounted cumulative gain will be driven to zero by the 
discount factor. We will use this extension in Section [5l 

4 Comparison of NormalHedge and Hedge 
4.1 Discounted regret bound for Hedge 

To ease the comparison, we first recast the Hedge algorithm [FS97j into our current framework with dis- 
counted gains. The weights used by Hedge are 

w'l = exp^Gf) 

where G\ is the discounted cumulative gain of action i at the start of iteration j, and r\ > is the learning 
rate parameter. When written recursively as 

wj +1 = cxp ^((1 - a)G\ + gffj cx (wfj exp(?^), 

we see that the effect of discounting is a dampening of the previous weights w\ prior to the usual multiplicative 
update rule. 

Fix any iteration j and define the adjusted cumulative gain of action i at the start of iteration k to be 

fe-i 

s=l 

with G® = 0. The gain of Hedge in iteration k is 

and the adjusted cumulative gain of Hedge at the start of iteration k is 

fc-i 

^ = ^(1-^1. 

s=l 

Then the discounted cumulative regret to action i at the start of iteration j is G\ — G J A . 
We analyze the (log of the) ratios Wk/Wk-i, where 

AT 
i=l 

and Wq = N. We lower bound ln(W}/Wo) as 

W N 

In —f = In V e" S ' - InJV > lne" S " - ]nN = r)G] - lnJV 

Wo ^ 

i=i 



G 



(for any i), and we upper bound it as 



Wj Li w, 



3-} TpN nG*- 1 Mi-ay 

V In ^ i=1 - 

^ -T N eVGl- 

k=l Z^i=l e 

t v d k i -^ h _ n/) j-i-k n k _a 



i Y^iv 77G /i \i— 1— k k 2 

< £ »? • g" 1 ~ + 1 ■ 4(1 " (Hoeffding's inequality) 

= £r ? (i- a r i - fe 5 u^(i-«) 2 °- 1 - fc) 

k=l 



2 l-(l-a) 

2 



2 



= ^ + "77 ~~ 2 

I f rv — rv^ 



4(a- a 2 /2) 

Therefore, the discounted cumulative regret of Hedge to action i at the start of any iteration j is 



Choosing 77 = ^/4(a — a 2 /2) InN gives 

/ IniV 

m < 



a-a 2 /2 



The regret bound is of the same form as that implied by Theorem[Tl indeed, with better leading constants. 
However, this bound only holds when r\ is tuned with knowledge of the number of actions N. If instead one 
sets 77 = Q(y/a) independently of N, the bound for Hedge is worse by a factor of O(VlniV). Furthermore, 
this setting of 77 is for optimizing a bound that anticipates the worst-case sequence of gains; when Nature is 
not optimally adversarial, then a proper setting of 77 may require other prior knowledge. 



4.2 Simulations 

4.2.1 The effect of good experts 

To empirically compare Hedge and NormalHedge, we first simulated the two algorithms in a scenario similar 
to that described in Section [2j 

• The number of experts is N — 1000, and the discount parameter is a = 0.001. 

• At any given time, there is a set of — f ■ N good experts and TV — bad experts. (We varied 

/ e {0.001,0.01,0.1,0.5}.) 

— With probability 0.5 + 7/2, every good expert receives gain +1; with probability 0.5 — 7/2, every 
good expert receives gain —1. (We varied 7 € {0.2,0.4,0.6,0.8}.) 

— Bad experts receive gain +1 and —1 with equal probability. 

• Initially, the set of good experts is {0, 1, ... , Nq — 1}- 

• After every 1/a iterations, the set of good experts shifts from {i , i + 1, . . . , i + Nq — 1} to {i + 
N G , i + N G + 1, • • • , io + 2AT G - 1} (with addition modulo N). 
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Thus, the set of good experts completely changes every 1/a iterations. In each iteration, all good experts 
receive the same gain, which is 7 in expectation. In contrast, the gain of each bad expert is decided 
independently with a fair coin. 

We tuned the learning rate parameter for Hedge to 77 = y/ (a — a 2 /2) InN. For NormalHedge, we varied 
c € {1,2,4}. Recall that the regret bound we can show for NormalHedge holds for c = 2 as a — > (the 
formal proof is stated with c = 4). 

Figures Q] and [2] depict the discounted cumulative regret to the best expert (averaged over 50 runs). 
First, we observe that NormalHedge fares better than Hedge when the advantage of the good experts is 
large and the fraction of experts that are good is large. In such cases, the advantage of NormalHedge is 
especially pronounced within 1/a iterations (before the set of good experts shifts). Second, we observe that 
the performance of NormalHedge generally improves as the value of c is decreased. Indeed, the setting of 
c = 1 (for which we have no theoretical guarantees) yields the best results for NormalHedge (and in fact 
outperforms Hedge in every simulation) . It would be very interesting to establish guarantees for NormalHedge 
for c — ► 1. 

4.2.2 The effect of tuning r\ in Hedge 

Next, to bring out the issue with parameter tuning in Hedge, we conducted a simulation in which we fix the 
fraction of experts that are good, but vary the total number of experts: 

• The number of experts is N, and the discount parameter is a = 0.001. (We varied N e {10, 100, 1000}.) 

• The fraction of experts that are good is fixed at / = 0.1. The notion of good and bad experts is the 
same as in the first simulation. (We varied 7 £ {0.2,0.8}.) 

• The remaining details are the same as in the first simulation. 

Again, we tuned the learning rate parameter for Hedge to r\ = y/(a — a 2 /2) logiV, which now changes as we 
vary the total number of experts, and we varied c € {1, 2, 4} in NormalHedge. 

The results (Figured]) indicate that as N decreases (e.g. N — 100, 10), the disparity between Hedge and 
NormalHedge increases. We believe this is an issue with tuning the learning rate 77, which is conspicuously 
absent in NormalHedge, but we have not precisely characterized the issue. 

5 Inferring latent random variables 

An important problem in statistical inference is to make predictions or choose actions when the system 
under consideration has internal states that cannot be observed directly. There are many manifestations of 
this problem, including Graphical models, Hidden Markov Models (HMMs), Partially Observable Markov 
Decision Processes (POMDPs) and Kalman filters. The common method for dealing with hidden states is to 
model them as latent random variables. The relation between the latent random variables and the observable 
random variables is modeled using a joint probability distribution. Two very important sub-problems that 
arise in this approach are learning joint distributions the involve latent random variables from examples that 
contain only the state of the observable random variables and using this type of joint distributions to infer 
the value of some variables given the state of others. At this time there is no good universal solution to 
either of these sub-problems. 

We propose a different approach to the problem, where instead of associating hidden states with hidden 
random variables, we associate states with different experts. What we present here describes some initial 
ideas. It is not an attempt to propose a solution to this large and complex problem. 

Suppose that we are to predict a binary sequence x%, X2, ■ ■ Xt G {0, 1} and suppose that we believe that 
the sequence can be predicted reasonably well using a Hidden Markov Model. Specifically, suppose there is 
a hidden state S which attains one of the values 1, . . . , k at each time step. Suppose that the state transition 
is Markovian and stationary, i.e. 

P{St\S t -x, S t _ 2 , . . .) = P{St\S t -x) = P{St-i\St- 2 ) = ■■■ 
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Assume in addition that the hidden state does not change very often, i.e. P(St+i = St) is close to 1. Finally, 
assume that the distribution of the observable variable X t depends only on the hidden state at the same 
time St- 

Consider the problem of predicting X t +i given x\, . . . , xt and the parameters of the HMM. Suppose that 
the prediction needs to take the form of a distribution over S. So far this is exactly the standard framework, 
but suppose we differ from the standard framework by considering the L\ loss l—p t (x t ), where pt{x t ) is the 
predicted probability assigned to the letter that actually occured at time t. This is instead of the standard 
log likelihood loss log(l/p t (x t ). While the log loss is easier to analyze, the L\ loss is often a more useful 
measure because the cumulative L\ loss corresponds to the expected number of mistakes. While this loss 
does not fit well in the maximal likelihood or Bayesian methodologies, it fits NormalHedge very well, because 
the loss per-iteration is bounded. 

Here is our proposal for solving the prediction problem using NormalHedge. We associate a set of experts 
with each hidden state. The experts are confidence rated, i.e. each one of the experts outputs a confidence 
level < c < 1 at each time step, the confidence level is used in the confidence rated variant of NormalHedge 
described in the previous section. If expert i corresponds to a hidden state j then a should be large when 

5 = j and low when S ^ j. Suppose that the parameters of the HMM are known, then we can associate 
a single expert with each hidden state and compute the prediction and the confidence value of that expert 
using Bayes formula. 

Now suppose that we don't know the parameter vector of the HMM but that we know that the vector 
is one of N possibilities. In this case we associate N experts with each hidden state and compute the 
predictions and confidence value of each expert using Bayes Formula for the corresponding parameter vector, 
the confidence value for each state is the a-posteriori probability for that state. 

In this case the NormalHedge algorithm will quickly converge and give most of the weight to the experts 
that correspond to the correct parameter vector. Moreover, if none of the parameter vectors is a correct 
description of the sequence distribution, it will converge on the vector which causes the least regret, i.e. 
makes the smallest number of mistakes. 

Contrast this with the Bayesian approach. If the true distribution generating the data is not included in 
the set of models over which we take the posterior average, and if the loss function in which we are interested 
is not log-likelihood but rather number of mistakes. Then the cumulative loss of the Bayesian average can 
be much larger than that of the best model in the set. 

6 Open problems 

The most interesting open problem is to close the gap between the upper bound and lower bound on the 
parameter c. We have a lower bound of c > 1 and an upper bound of c = 4. If we consider the case a — > 
we can reduce c to 2. However, the gap between c = 1 and c = 2 remains. 

One promising direction of expansion is to consider the game in the continuous time limit directly. This 
leads us naturally into stochastic processes in continuous time such as Wiener processes. Understanding 
the performance of NormalHedge in this context might yield new methods for stochastic estimation and 
stochastic control. 
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A Proof of main theorem 

Recall, the cumulative discounted regret of action i at time t — ja, j £ N is defined recursively by 

Ri(Q) = 0, Ri(t + a) = (1 - a)Ri(t) + 9l {t) - g A (t), 

where gi(t) e [—y/a, +\/a\ is the (scaled) gain of action i at time t, and gA(t) € [—y/a, +\/a\ is the (scaled) 
gain of the hedger at time t. We define 7"j(i) = (<?i(i) — 9A(t))/V®- £ [ — 2, +2] as the (unsealed) instantaenous 
regret to action i at time t. The central quantity of interest is the average potential 

1 N 

i=i 

Recall, we use the definition of the potential function $ in Equation ((2]) with c = 4. 

Claim 1 There exists a positive constant C < 2.32 such that if a < 1/ (800 In CN), then the average potential 
is always bounded from above by C ; that is, ^(jot) < C for any j G N. 

Proof. Fix j e N and let t = ja. 

We will analyze the average + a) — *5f(t) by considering the averages over two separate groups: 

h={i: Ri(t) < 0} and J 2 = {i : R,{t) > 0}. 

Let 'Pfc(i) = Y^,iei k ^C^»(*)) ^ e ^ ne avera g e potential for k = 1,2 (assume without loss of 

generality that neither Ik is empty). We'll show the following facts: 

(A) : *i(i) = 1 and tf x (t + a) < 1 + (3/5)a; 

(B) : If *(t) < 2.32, then * 2 (i + a) - * 2 (*) < (2/3)a; 

(C) : If 2.31 < < 2.32, then + a) < *(t). 

These facts imply that the increase in average potential from ^(t) to "i>(t + a) is always less than (2/3)a < 
1/1200, and that if the average potential *B(t) is strictly between 2.31 and 2.32, then *B(t + a) is strictly less 
than ^(t). The claim then follows by induction because ^(O) = 1. 
We now prove the facts (A), (B), and (C). 

(A): For i <E h, $(Ri{t)) = 1 and R t (t + a) < (1 - a)R t (t) + \n(t)\y/a < 2y/a. Since ®(x) is non- 
decreasing in x, we have <f>(Ri(t + a)) < <5>{2y/b7) = e a / 2 < 1 + a/2 + a 2 e a / 2 /2 < 1 + (3/5)a (the last 
inequality follows from the upper bound on a). 
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(B): We address terms in I 2 by expanding &(Ri(t + a)) around the point Ri{t) ^ via Taylor's theorem: 
$(i2i(t + a)) = $(Ri(t)) + di(t)&(Ri(t)) + irf 4 (i) 2 $"(ft) 

where di(t) = ri{t)^fa — aRi{t) and p, € K lies between and + a). Because the hedger's weights 

are chosen so that Pi{t) oc $>'(Ri(t)), we have that 

N N 



i=l i=l 

and thus 

+ a)) - $(Ri(t)) = -aRi(t)*'(Ri(t)) + i^(t) 2 0>"( ft ). 

We need a few bounds before proceeding. First, if ^>(t) < 2.32, then $>(Ri(t)) < 2.327V for all i, which 
implies Ri(t) < ^/8 ln(2.32/V) for all i. By the condition on a, we also have \faRi(t) < 1/10. Next, we use 
a bound on pi since it is evaluated in the non-decreasing function $"(x): 

p 2 < max{Ri(t),Ri(t + a)} 2 < {R l (t) + \ n (t)\ = R^tf + 2^R l {t)\r l {t)\+ a n {t) 2 < R l (t) 2 + ^. 
Finally, we bound di(t) 2 as follows: 

di(t) 2 < (Irj(t)lv^ + aR,(t)) 2 < (2 + l/10) 2 a < |a. 

Altogether, we have 

$(i2i(t + a)) - $(Ri(t)) = -aR t {t)<S>'{R t {t)) + ^d t (t) 2 <5>" ( Pi ) 



o2 s 



= -aRm^e«^ + 1 -d i (tf(l + ^ 

R ^ t ) 2 r Ri(t?/S , £^ f , R i(t) 2 \ p Rt(t) 2 /8 p l/ 



4 " 6 4 V 32 ' 16 



.2 

< a 

3 10 



The final bound is decreasing as a function of Ri(t) > 0. This implies &(Ri(t + a)) — <&(i?j(i)) < (2/3)a, so 
V 2 (t + a) -* 2 (i) < (2/3)a. 

(C): First, consider the problem of maximizing 



/(zi,---,^) = ^Q-f^) 

2=1 ^ ' 



/8 



subject to the constraint (1/n) X)"=i > ^ f° r some £> > 1. Simple variational arguments imply that 
the maximum is attained when X{ = V8 lnB for all i. Therefore, following the argument for (B), we have 
that if * 2 (i) > B for some B > 1, then 

* 2 (i + a) - * 2 (i) < a • -B • ( ^ - - mB ] . 



, 3 5 

Let pi = |/i|/iV and p 2 = 1 — Pi- Suppose $(t) > 2.31. Because $i(t) — 1, we have 

* 2 (t) = — (*(*) - Pi) > —(2.31 - P i)=B. 

P2 P2 
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Now we analyze the overall change in average potential. By (A), the increase in average potential over 
i € h is less than (3/5)<x Then 



tf(i + a) - tf(t) 3 ; — . „ 

— < Pi • 7 +P 2 ■ .B • - — -InB 



= ft .| + (2.31-p 1 ).(|-|ln T -^-|ln(2.31- P i) 

The hnal RHS is decreasing as a function of p\ > 0, so it is maximized when pi = 0. Making this 
substitution, the RHS is negative, and thus + a) < ^(t). I 
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